Data Set
The data set consists of 7,043 observations of 21 variables as shown below:
- customerID: Unique identifier for each customer
- gender: Gender of customer
- SeniorCitizen: A binary variable which identifies if a customer is a Senior Citizen (1) or not (0)
- Partner: A binary variable which identifies if a customer has a partner (Yes/No)
- Dependents: A binary variable which identifies if a customer has any dependents (Yes/No)
- tenure: The tenure of association of the customer with the company in months
- PhoneService: A binary variable which identifies if a customer has phone service
- MultipleLines: Variable that identfies if customers having phone lines have multiple lines or not
- InternetService: This variable identifies the type of internet service that the customer has
- OnlineSecurity: Variable that identfies if customers having internet service have online security or not
- Onlinebackup: Variable that identfies if customers having internet service have online backup or not
- DeviceProtection: Variable that identfies if customers having internet service have device protection or not
- TechSupport: Variable that identfies if customers having internet service have availed tech support or not
- StreamingTV: Variable that identfies if customers having internet service have TV Streaming or not
- StreamingMovies: Variable that identfies if customers having internet service have Movie Streaming or not
- Contract: Variable that identfies the contract type of the customers
- PaperlessBilling: Variable that identfies if the customer has opted for paperless billing
- PaymentMethod: Variable that identfies the payment method of the customer
- MonthlyCharges: Variable that identfies the monthly bill amount of the customer
- TotalCharges: Variable that identfies the overall bill amount of the customer throughout the tenure
- Churn: Variable that identfies if the customer has cancelled the company’s services
Let’s look at the distinct values present in each categorical variable.
Some factor variables have 3 types, let’s see what those values are, and if we can group them in anyway.
options(width = 10)
telco_customer_churn_raw_data %>%
summarise_all(funs(n_distinct(.)))
as.data.frame(table(telco_customer_churn_raw_data$MultipleLines))
as.data.frame(table(telco_customer_churn_raw_data$InternetService))
as.data.frame(table(telco_customer_churn_raw_data$OnlineSecurity))
as.data.frame(table(telco_customer_churn_raw_data$OnlineBackup))
as.data.frame(table(telco_customer_churn_raw_data$DeviceProtection))
as.data.frame(table(telco_customer_churn_raw_data$TechSupport))
as.data.frame(table(telco_customer_churn_raw_data$StreamingMovies))
as.data.frame(table(telco_customer_churn_raw_data$StreamingTV))
As we can see, some of the factors have 3 values, but the categories “No Internet Service”/“No Phone Service” are not adding any additional value as that information is already captured in a separate variable. So, let’s update these two values to “No”.
Also, since all categorical variables are in yes/no format except Senior Citizen, we can transform that also into Yes/No format. Let’s also transform the output variable Churn into 1/0 format so that we do not have to dummify it further.
#Transform MultipleLines Variable
telco_customer_churn_transformed_data <- telco_customer_churn_raw_data %>%
mutate(MultipleLines = ifelse(MultipleLines=="No phone service","No",MultipleLines))
#Transform OnlineSecurity Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(OnlineSecurity = ifelse(OnlineSecurity=="No internet service","No",OnlineSecurity))
#Transform OnlilneBackup Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(OnlineBackup = ifelse(OnlineBackup=="No internet service","No",OnlineBackup))
#Transform DeviceProtection Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(DeviceProtection = ifelse(DeviceProtection=="No internet service","No",DeviceProtection))
#Transform TechSupport Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(TechSupport = ifelse(TechSupport=="No internet service","No",TechSupport))
#Transform Streaming Movies Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(StreamingMovies = ifelse(StreamingMovies=="No internet service","No",StreamingMovies))
#Transform Streaming TV Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(StreamingTV = ifelse(StreamingTV=="No internet service","No",StreamingTV))
#Transform Senior Citizen Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(SeniorCitizen = ifelse(SeniorCitizen==1,"Yes","No"))
#Transform Churn Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(Churn = ifelse(Churn=="Yes",1,0))
Let’s look at the structure of the dataset.
telco_customer_churn_transformed_data %>%
plot_str()
From the above graph we can notice that some categorical variables are currently ‘characters’. We can convert them to factors for more accurate representation.
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen),
gender = as.factor(gender),
Partner = as.factor(Partner),
Dependents = as.factor(Dependents),
SeniorCitizen = as.factor(SeniorCitizen),
PhoneService = as.factor(PhoneService),
MultipleLines = as.factor(MultipleLines),
InternetService = as.factor(InternetService),
OnlineSecurity = as.factor(OnlineSecurity),
OnlineBackup = as.factor(OnlineBackup),
DeviceProtection = as.factor(DeviceProtection),
TechSupport = as.factor(TechSupport),
StreamingTV = as.factor(StreamingTV),
StreamingMovies = as.factor(StreamingMovies),
PaperlessBilling = as.factor(PaperlessBilling),
Contract = as.factor(Contract),
PaymentMethod = as.factor(PaymentMethod),
Churn = as.factor(Churn))
Data Exploration
As the graph below shows, 99.8% of the observations are complete.
telco_customer_churn_transformed_data %>%
plot_intro()

The below graph shows us that “TotalCharges” has missing values. Though they are very few, we will check the reason behind this in further sections.

From the below count we can see that there are no duplicate Customer IDs in the dataset. There are 7043 records and 7043 unique Customer IDs
telco_customer_churn_transformed_data %>%
summarise(Total_Num_Records = n(),Unique_CustomerIDs = n_distinct(customerID))
From the below frequency distributions we can see that there are no duplicate values in any categorical variable. The data appears to be clean.
#Customer ID related information has already been check above. It is not required here.
telco_customer_churn_transformed_data %>%
select(-customerID) %>%
plot_bar(ggtheme = theme_light(), title = "Categorical Variable Distributions")


From the below histograms we can see that there are some customers with a tenure of 0 months. We assume that these are new customers, added in the past month.
Also, notice that Total Charges is skewed to the right, which makes sense as we have customers with high tenure. The total charges for them will have accumulated for a long period.
telco_customer_churn_transformed_data %>%
plot_histogram(ggtheme = theme_light())

The missing values of “TotalCharges” that we noticed in earlier sections is because of the new customers (tenure=0). Since they have not completed a billing cycle yet, their Total Charge value is missing. We can confirm this with the result below, which checks if there are any missing TotalCharges values for observations with non-zero tenures. As we can see there are none. So, we can conclude that missing TotalCharges are because of 0 tenure.
telco_customer_churn_transformed_data %>%
filter(is.na(TotalCharges) && tenure!=0)
Let’s convert missing TotalCharges to 0 and plot missing values again.
#Check if TotalCharges is NA and replace with 0
telco_customer_churn_cleaned_data <- telco_customer_churn_transformed_data %>%
mutate(TotalCharges = ifelse(is.na(TotalCharges),0,TotalCharges))
telco_customer_churn_cleaned_data %>%
plot_missing()

Before proceeding to modeling, let’s check if there are any trends in the data.
The overall categorical variables in this data set can be classified as follows:
- variables that define attributes of a customer
- variables that define attributes of the service
Let’s check how these variables affect Churn by plotting them
Customer Atrributes
From the graphs below we can notice that:
- Gender does not have a clear trend for Churn; the gender split of churned customers is nearly 50-50.
- Senior Citizens and customers with dependents seems to have lower churn, we can check the effect of these variable on churn further.
#using plotly here just to explore the functinalities in this package more
#Plot by SeniorCitizen
telco_customer_churn_cleaned_data %>%
group_by(Churn,SeniorCitizen) %>%
summarise(count = n()) %>%
plot_ly(x= ~SeniorCitizen, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
layout(title = 'Senior Citizen distribution based on Churn')
#Plot by Gender
telco_customer_churn_cleaned_data %>%
group_by(Churn,gender) %>%
summarise(count = n()) %>%
plot_ly(x= ~gender, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
layout(title = 'Gender distribution based on Churn')
#Plot by Dependents
telco_customer_churn_cleaned_data %>%
group_by(Churn,Dependents) %>%
summarise(count = n()) %>%
plot_ly(x= ~Dependents, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
layout(title = 'Dependent distribution based on Churn')
#Plot by Partner
telco_customer_churn_cleaned_data %>%
group_by(Churn,Partner) %>%
summarise(count = n()) %>%
plot_ly(x= ~Partner, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
layout(title = 'Partner distribution based on Churn')
Service Atrributes
From the plots below we can infer the following:
- Customers with Fiber Optic Internet Sevice seems to churn more
- Customer with Month-to-month contracts seems to churn more
- Customers who pay their bills through Electronic Check seem to churn more
- Customers who have opted for paperless bills seem to churn more
#using ggplot here just to explore the functinalities in this package more
#Plot by SeniorCitizen
p1 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,PhoneService) %>%
summarise(count = n()) %>%
ggplot(aes(PhoneService,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("PhoneService distribution") +
labs(
x = "Phone Service",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by Gender
p2 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,MultipleLines) %>%
summarise(count = n()) %>%
ggplot(aes(MultipleLines,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("MultipleLines distribution") +
labs(
x = "Multiple Lines",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by Dependents
p3 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,InternetService) %>%
summarise(count = n()) %>%
ggplot(aes(InternetService,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("InternetService distribution") +
labs(
x = "Internet Service",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by OnlineSecurity
p4 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,OnlineSecurity) %>%
summarise(count = n()) %>%
ggplot(aes(OnlineSecurity,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("OnlineSecurity distribution") +
labs(
x = "Online Security",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by OnlineBackup
p5 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,OnlineBackup) %>%
summarise(count = n()) %>%
ggplot(aes(OnlineBackup,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("OnlineBackup distribution") +
labs(
x = "Online Backup",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by DeviceProtection
p6 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,DeviceProtection) %>%
summarise(count = n()) %>%
ggplot(aes(DeviceProtection,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("DeviceProtection distribution") +
labs(
x = "Device Protection",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by TechSupport
p7 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,TechSupport) %>%
summarise(count = n()) %>%
ggplot(aes(TechSupport,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("TechSupport distribution") +
labs(
x = "Tech Support",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by StreamingTV
p8 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,StreamingTV) %>%
summarise(count = n()) %>%
ggplot(aes(StreamingTV,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("StreamingTV distribution") +
labs(
x = "StreamingTV",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by StreamingMovies
p9 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,StreamingMovies) %>%
summarise(count = n()) %>%
ggplot(aes(StreamingMovies,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("StreamingMovies distribution") +
labs(
x = "Streaming Movies",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by PaperlessBilling
p10 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,PaperlessBilling) %>%
summarise(count = n()) %>%
ggplot(aes(PaperlessBilling,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("PaperlessBilling distribution") +
labs(
x = "Paperless Billing",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by Contract
p11 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,Contract) %>%
summarise(count = n()) %>%
ggplot(aes(Contract,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("Contract Type distribution") +
labs(
x = "Contract Type",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
#Plot by PaymentMethod
p12 <- telco_customer_churn_cleaned_data %>%
group_by(Churn,PaymentMethod) %>%
summarise(count = n()) %>%
ggplot(aes(PaymentMethod,count,fill = Churn))+
geom_bar(stat = "identity", position = "dodge")+
coord_flip()+
ggtitle("PaymentMethod distribution") +
labs(
x = "Payment Method",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.background = element_rect(
fill = "white",
colour = "lightgrey",
size = 0.75),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
)
grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12)

Other Attributes
Let’s look at the following variables in this section:
- Tenure
- Monthly Charges
- Overall Charges
We can notice that customers who churned were predominantly from lower tenure groups. So, this could be an important variable for our modeling. Overallcharges also has the same distribution with respect to churn, but that could be because it has strong correlation with tenure (OverallCharges = Tenure x MonthlyCharges).
#Tenure
telco_customer_churn_cleaned_data %>%
plot_ly(x= ~tenure, name = ~Churn, type = 'histogram', showlegend = T, width = 500, height = 400) %>%
layout(title = 'Tenure distribution by Churn')
#Monthly Charges
telco_customer_churn_cleaned_data %>%
plot_ly(x= ~MonthlyCharges, name = ~Churn, type = 'histogram', showlegend = T, width = 500, height = 400) %>%
layout(title = 'Monthly Charges distribution by Churn')
#Overall Charges
telco_customer_churn_cleaned_data %>%
plot_ly(x= ~TotalCharges, name = ~Churn, type = 'histogram', showlegend = T, width = 500, height = 400) %>%
layout(title = 'Total Charges distribution by Churn')
By plotting Tenure against MonthlyCharges, we can notice in the below graph that churned customers are mostly concentrated in the low tenure + high monthly charges region. These two variables could be useful in our model.
telco_customer_churn_cleaned_data %>%
plot_ly(x = ~tenure, y = ~MonthlyCharges, name = ~Churn, width = 1000, height = 500) %>%
layout(title = 'Monthly Charges vs Tenure by Churn')
By plotting Tenure against TotalCharges, we can notice in the below graph that churned customers are mostly concentrated in the low tenure + low TotalCharges region.
telco_customer_churn_cleaned_data %>%
plot_ly(x = ~tenure, y = ~TotalCharges, name = ~Churn, width = 1000, height = 500) %>%
layout(title = 'Total Charges vs Tenure by Churn')
Plotting Monthly Charges against TotalCharges may not be very useful, but let’s see if the plot can give us some insight. As we can see churned customers are in the low-side of TotalCharges (these could be customers with low tenures).
telco_customer_churn_cleaned_data %>%
plot_ly(x = ~MonthlyCharges, y = ~TotalCharges, name = ~Churn, width = 1000, height = 500) %>%
layout(title = 'Monthly Charges vs Total Charges by Churn')
Let’s plot the distribution of categorical variables specifically in churned customers to check if any variable stands out.
Customer Attributes distribution on churned customers
We saw earlier that Senior Citizen and Dependent variables could have correlation with Churn, let’s see what is the distribution of these variables in churned customers. As we can see, about 25% of the churned customers are senior citizens and 17.4 percent are those with dependents. So, these attributes may not be as useful as we thought they are afterall.
telco_customer_churn_cleaned_data %>%
filter(Churn==1) %>%
plot_ly(labels = ~SeniorCitizen, values = ~Churn, type = "pie", width = 500, height = 400) %>%
layout(title = 'SeniorCitizens distribution in Churned Customers')
telco_customer_churn_cleaned_data %>%
filter(Churn==1) %>%
plot_ly(labels = ~Dependents, values = ~Churn, type = "pie", width = 600, height = 400) %>%
layout(title = 'Dependent distribution in churned customers')
Service Attributes distribution on churned customers
We saw earlier that Payment Method, Contract Type, Paperless Billing and Internet Service Type attributes could have correlation with Churn, let’s see what is the distribution of these variables in churned customers.
As we can notice from the graphs below, all the 4 variables have a specific category that contributes to more than 50% of churned customers. These variables could be very useful for our models.
telco_customer_churn_cleaned_data %>%
filter(Churn==1) %>%
plot_ly(labels = ~PaymentMethod, values = ~Churn, type = "pie", width = 600, height = 400) %>%
layout(title = 'Payment Method distribution in Churned Customers')
telco_customer_churn_cleaned_data %>%
filter(Churn==1) %>%
plot_ly(labels = ~Contract, values = ~Churn, type = "pie", width = 600, height = 400) %>%
layout(title = 'ContractType distribution in churned customers')
telco_customer_churn_cleaned_data %>%
filter(Churn==1) %>%
plot_ly(labels = ~PaperlessBilling, values = ~Churn, type = "pie", width = 600, height = 400) %>%
layout(title = 'PaperlessBilling distribution in churned customers')
telco_customer_churn_cleaned_data %>%
filter(Churn==1) %>%
plot_ly(labels = ~InternetService, values = ~Churn, type = "pie", width = 600, height = 400) %>%
layout(title = 'InternetService distribution in churned customers')
Modeling
Dummify the data so that categorical variables are converted to multiple columns with 0/1s.
telco_customer_churn_dummified <- telco_customer_churn_cleaned_data %>%
dummify() %>%
mutate(Churn_yes = as.factor(ifelse(Churn_1==1, "Yes", "No")))
Partition data into Training and Validation datasets with a 70-30 split of the full dataset.
set.seed(10)
inTrainingData <- createDataPartition(telco_customer_churn_dummified$Churn_yes, p=0.70, list=FALSE)
training.set <- telco_customer_churn_dummified[inTrainingData,]
Totalvalidation.set <- telco_customer_churn_dummified[-inTrainingData,]
Random Forest
Let’s try some mtry values and see which value gives us a lower OOB error percentage.
After repeated trials, mtry 5 was found to give the lowest OOB error of around 21% as shown below.
As we can see from the Variable Importance Plots below, the most important variables for predicting Churn are
- tenure
- TotalCharges
- MonthlyCharges
- Contract_Month.to.month
- InternetServuce_Fiber.optic
- PaymentMethod.Electronic.check
which is pretty much consistent with what we found in our EDA.
training.set <- training.set %>%
select(-customerID, -Churn_0, -Churn_1)
#rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 2)
#rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 3)
#rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 4)
rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "roc", ntree = 1001, mtry = 5)
VariableImp
Call:
randomForest(formula = Churn_yes ~ ., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 5)
Type of random forest: classification
Number of trees: 1001
No. of variables tried at each split: 5
OOB estimate of error rate: 20.79%
Confusion matrix:
No Yes class.error
No 3242 380 0.1049144
Yes 645 664 0.4927426
varImpPlot(VariableImp)

Let’s use these variable and build a Random Forest model.
rf_fit <- train(Churn_yes ~ tenure+
MonthlyCharges+
TotalCharges+
Contract_Month.to.month+
InternetService_Fiber.optic,
data = training.set,
method = "rf",
metric = "Roc",
tunegrid=tunegrid,
trControl=control,
preProcess = c("center", "scale"),
ntree=1001)
The metric "Roc" was not in the result set. ROC will be used instead.
Let’s look at the model stistics of the Random Forest model.
As we can see the best ROC was achieved at mtry = 2. The model has high number of false negatives (number of churned customers who were classified as Not Churned). Let’s see if Logistic Regression can give us better predictions in the next section.
print(rf_fit)
Random Forest
4931 samples
5 predictor
2 classes: 'No', 'Yes'
Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 4438, 4437, 4438, 4437, 4439, 4438, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.8205527 0.9008828 0.4702877
3 0.8012674 0.8728120 0.4937385
5 0.7946440 0.8726291 0.4825328
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
plot(rf_fit)

pred_test <- predict(rf_fit,newdata = Totalvalidation.set)
confusionMatrix(data=pred_test, Totalvalidation.set$Churn_yes)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1417 305
Yes 135 255
Accuracy : 0.7917
95% CI : (0.7737, 0.8088)
No Information Rate : 0.7348
P-Value [Acc > NIR] : 7.793e-10
Kappa : 0.408
Mcnemar's Test P-Value : 7.834e-16
Sensitivity : 0.9130
Specificity : 0.4554
Pos Pred Value : 0.8229
Neg Pred Value : 0.6538
Prevalence : 0.7348
Detection Rate : 0.6709
Detection Prevalence : 0.8153
Balanced Accuracy : 0.6842
'Positive' Class : No
results_1 <- confusionMatrix(data=pred_test, Totalvalidation.set$Churn_yes)
confusion_mat1 <- as.data.frame(results_1$table)
ggplot(data = confusion_mat1, mapping = aes(x = Reference, y = Prediction)) +
ggtitle("Confusion Matrix for Random Forest") +
geom_tile(aes(fill = Freq), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "#B2DBD5", high = "#2B616D") +
theme_bw() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold"))

Logistic Regression
Let’s try to find out which predictors are statistically significant for a GLM model.
As we can see, the following variables are statistically significant in this model:
- tenure
- totalCharges
- Contract_Month.to.month
- PaperlessBilling_No
- PaymentMethod_Electronic.check
as they have low P-values. This is almost similar to what we found in Random Forest, except that MonthlyCharges has not been found statistically significant here.
summary(glm_fit_check)
Call:
glm(formula = Churn_yes ~ ., family = "binomial", data = training.set)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9221 -0.6631 -0.2966 0.7216 3.3557
Coefficients: (16 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.462e+00 2.480e+00 0.589 0.555573
tenure -5.652e-02 7.249e-03 -7.797 6.32e-15 ***
MonthlyCharges -6.236e-02 3.798e-02 -1.642 0.100580
TotalCharges 3.005e-04 8.235e-05 3.649 0.000263 ***
gender_Female 4.064e-03 7.763e-02 0.052 0.958253
gender_Male NA NA NA NA
SeniorCitizen_No -2.301e-01 1.017e-01 -2.262 0.023702 *
SeniorCitizen_Yes NA NA NA NA
Partner_No 2.341e-02 9.347e-02 0.250 0.802275
Partner_Yes NA NA NA NA
Dependents_No 5.765e-02 1.067e-01 0.540 0.588970
Dependents_Yes NA NA NA NA
PhoneService_No -6.026e-01 7.765e-01 -0.776 0.437748
PhoneService_Yes NA NA NA NA
MultipleLines_No -5.422e-01 2.130e-01 -2.546 0.010894 *
MultipleLines_Yes NA NA NA NA
InternetService_DSL 2.276e+00 9.662e-01 2.355 0.018514 *
InternetService_Fiber.optic 4.616e+00 1.907e+00 2.421 0.015486 *
InternetService_No NA NA NA NA
OnlineSecurity_No 9.031e-02 2.135e-01 0.423 0.672335
OnlineSecurity_Yes NA NA NA NA
OnlineBackup_No -1.312e-01 2.098e-01 -0.625 0.531761
OnlineBackup_Yes NA NA NA NA
DeviceProtection_No -2.724e-01 2.099e-01 -1.298 0.194367
DeviceProtection_Yes NA NA NA NA
TechSupport_No 1.323e-01 2.148e-01 0.616 0.537858
TechSupport_Yes NA NA NA NA
StreamingTV_No -7.836e-01 3.912e-01 -2.003 0.045192 *
StreamingTV_Yes NA NA NA NA
StreamingMovies_No -7.989e-01 3.905e-01 -2.046 0.040777 *
StreamingMovies_Yes NA NA NA NA
Contract_Month.to.month 1.271e+00 1.989e-01 6.391 1.64e-10 ***
Contract_One.year 6.094e-01 2.002e-01 3.044 0.002335 **
Contract_Two.year NA NA NA NA
PaperlessBilling_No -4.228e-01 8.977e-02 -4.710 2.48e-06 ***
PaperlessBilling_Yes NA NA NA NA
PaymentMethod_Bank.transfer..automatic. 7.921e-02 1.381e-01 0.573 0.566389
PaymentMethod_Credit.card..automatic. -4.723e-02 1.405e-01 -0.336 0.736698
PaymentMethod_Electronic.check 4.104e-01 1.157e-01 3.549 0.000387 ***
PaymentMethod_Mailed.check NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5707.1 on 4930 degrees of freedom
Residual deviance: 4085.4 on 4907 degrees of freedom
AIC: 4133.4
Number of Fisher Scoring iterations: 6
Let;s build a regression model using these variables and check how they predict churn.
training.set.2 <- training.set %>%
mutate(Churn_yes = as.factor(ifelse(Churn_yes=="Yes",1,0)))
Totalvalidation.set.2 <- Totalvalidation.set %>%
mutate(Churn_yes = as.factor(ifelse(Churn_yes=="Yes",1,0)))
control_reg <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(10)
reg_fit <- train(Churn_yes ~ tenure+
TotalCharges+
Contract_Month.to.month+
PaperlessBilling_No+
PaymentMethod_Electronic.check+
Contract_One.year,
data = training.set.2,
method = "glm",
family = "binomial",
preProcess = c("center","scale"),
trControl = control_reg)
As we can see the accuracy level has dropped to 77%.
print(reg_fit)
Generalized Linear Model
4931 samples
6 predictor
2 classes: '0', '1'
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 4437, 4438, 4437, 4438, 4439, 4438, ...
Resampling results:
Accuracy Kappa
0.7838158 0.4029077
pred_test_reg <- predict(reg_fit,newdata = Totalvalidation.set.2)
confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1383 307
1 169 253
Accuracy : 0.7746
95% CI : (0.7562, 0.7923)
No Information Rate : 0.7348
P-Value [Acc > NIR] : 1.437e-05
Kappa : 0.3722
Mcnemar's Test P-Value : 3.399e-10
Sensitivity : 0.8911
Specificity : 0.4518
Pos Pred Value : 0.8183
Neg Pred Value : 0.5995
Prevalence : 0.7348
Detection Rate : 0.6548
Detection Prevalence : 0.8002
Balanced Accuracy : 0.6714
'Positive' Class : 0
results_2 <- confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)
confusion_mat2 <- as.data.frame(results_2$table)
ggplot(data = confusion_mat2, mapping = aes(x = Reference, y = Prediction)) +
ggtitle("Confusion Matrix for Regression Iteration 1") +
geom_tile(aes(fill = Freq), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "#B2DBD5", high = "#2B616D") +
theme_bw() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold"))

Let’s try to run regression using the same variables that we used in Random Forest and check the results
control_reg_2 <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(10)
reg_fit_2 <- train(Churn_yes ~ tenure+
MonthlyCharges+
TotalCharges+
Contract_Month.to.month+
InternetService_Fiber.optic,
data = training.set.2,
method = "glm",
family = "binomial",
preProcess = c("center","scale"),
trControl = control_reg)
As we can see the accuracy level is still 77%. And the false negatives are still on the higher side.
print(reg_fit_2)
Generalized Linear Model
4931 samples
5 predictor
2 classes: '0', '1'
Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 4438, 4439, 4438, 4438, 4438, 4438, ...
Resampling results:
Accuracy Kappa
0.7864541 0.4129474
pred_test_reg_2 <- predict(reg_fit_2,newdata = Totalvalidation.set.2)
confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1383 307
1 169 253
Accuracy : 0.7746
95% CI : (0.7562, 0.7923)
No Information Rate : 0.7348
P-Value [Acc > NIR] : 1.437e-05
Kappa : 0.3722
Mcnemar's Test P-Value : 3.399e-10
Sensitivity : 0.8911
Specificity : 0.4518
Pos Pred Value : 0.8183
Neg Pred Value : 0.5995
Prevalence : 0.7348
Detection Rate : 0.6548
Detection Prevalence : 0.8002
Balanced Accuracy : 0.6714
'Positive' Class : 0
results_3 <- confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)
confusion_mat3 <- as.data.frame(results_3$table)
ggplot(data = confusion_mat3, mapping = aes(x = Reference, y = Prediction)) +
ggtitle("Confusion Matrix for Regression Iteration 2") +
geom_tile(aes(fill = Freq), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "#B2DBD5", high = "#2B616D") +
theme_bw() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold"))

---
title: "Assignment3 - Predict Customer Churn"
output: html_notebook
---

###Introduction

---

The aim of this analysis is to predict customer churn in a Telecom company. Two classification models will be used to predict the likelyhood of customer churn. The two models will be compared based on their accuracy.
```{r include = FALSE, message=FALSE}
#The R libraries listed below are used for this analysis. This block loads the libraries and reads data from the file
library(tidyverse)
library(ggplot2)
library(DataExplorer)
library(plotly)
library(caret)
library(randomForest)
library(gridExtra)
#read data file from project data folder present in project working directory
telco_customer_churn_raw_data <- read_csv("data\\WA_Fn-UseC_-Telco-Customer-Churn.csv")
```

###Data Set

---

The data set consists of 7,043 observations of 21 variables as shown below: <br>

* **customerID:** Unique identifier for each customer
* **gender:** Gender of customer
* **SeniorCitizen:** A binary variable which identifies if a customer is a Senior Citizen (1) or not (0)
* **Partner:** A binary variable which identifies if a customer has a partner (Yes/No)
* **Dependents:** A binary variable which identifies if a customer has any dependents (Yes/No)
* **tenure:** The tenure of association of the customer with the company in months
* **PhoneService:** A binary variable which identifies if a customer has phone service
* **MultipleLines:** Variable that identfies if customers having phone lines have multiple lines or not
* **InternetService:** This variable identifies the type of internet service that the customer has
* **OnlineSecurity:** Variable that identfies if customers having internet service have online security or not
* **Onlinebackup:** Variable that identfies if customers having internet service have online backup or not
* **DeviceProtection:** Variable that identfies if customers having internet service have device protection or not
* **TechSupport:** Variable that identfies if customers having internet service have availed tech support or not
* **StreamingTV:** Variable that identfies if customers having internet service have TV Streaming or not
* **StreamingMovies:** Variable that identfies if customers having internet service have Movie Streaming or not
* **Contract:** Variable that identfies the contract type of the customers
* **PaperlessBilling:** Variable that identfies if the customer has opted for paperless billing
* **PaymentMethod:** Variable that identfies the payment method of the customer
* **MonthlyCharges:** Variable that identfies the monthly bill amount of the customer
* **TotalCharges:** Variable that identfies the overall bill amount of the customer throughout the tenure
* **Churn:** Variable that identfies if the customer has cancelled the company's services<br>

Let's look at the distinct values present in each categorical variable.

Some factor variables have 3 types, let's see what those values are, and if we can group them in anyway.
```{r message=FALSE}
options(width = 10)
telco_customer_churn_raw_data %>%
     summarise_all(funs(n_distinct(.)))

as.data.frame(table(telco_customer_churn_raw_data$MultipleLines))
as.data.frame(table(telco_customer_churn_raw_data$InternetService))
as.data.frame(table(telco_customer_churn_raw_data$OnlineSecurity))
as.data.frame(table(telco_customer_churn_raw_data$OnlineBackup))
as.data.frame(table(telco_customer_churn_raw_data$DeviceProtection))
as.data.frame(table(telco_customer_churn_raw_data$TechSupport))
as.data.frame(table(telco_customer_churn_raw_data$StreamingMovies))
as.data.frame(table(telco_customer_churn_raw_data$StreamingTV))
```      
As we can see, some of the factors have 3 values, but the categories "No Internet Service"/"No Phone Service" are not adding any additional value as that information is already captured in a separate variable. So, let's update these two values to "No".

Also, since all categorical variables are in yes/no format except Senior Citizen, we can transform that also into Yes/No format. Let's also transform the output variable Churn into 1/0 format so that we do not have to dummify it further.
```{r message=FALSE}
#Transform MultipleLines Variable
telco_customer_churn_transformed_data <- telco_customer_churn_raw_data %>%
     mutate(MultipleLines = ifelse(MultipleLines=="No phone service","No",MultipleLines))
#Transform OnlineSecurity Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(OnlineSecurity = ifelse(OnlineSecurity=="No internet service","No",OnlineSecurity))
#Transform OnlilneBackup Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(OnlineBackup = ifelse(OnlineBackup=="No internet service","No",OnlineBackup))
#Transform DeviceProtection Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(DeviceProtection = ifelse(DeviceProtection=="No internet service","No",DeviceProtection))
#Transform TechSupport Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(TechSupport = ifelse(TechSupport=="No internet service","No",TechSupport))
#Transform Streaming Movies Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(StreamingMovies = ifelse(StreamingMovies=="No internet service","No",StreamingMovies))
#Transform Streaming TV Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(StreamingTV = ifelse(StreamingTV=="No internet service","No",StreamingTV))
#Transform Senior Citizen Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(SeniorCitizen = ifelse(SeniorCitizen==1,"Yes","No"))
#Transform Churn Variable
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(Churn = ifelse(Churn=="Yes",1,0))

```

Let's look at the structure of the dataset.
```{r fig.width=8,  message=FALSE}
telco_customer_churn_transformed_data %>%
     plot_str()
```
From the above graph we can notice that some categorical variables are currently 'characters'. We can convert them to factors for more accurate representation.
```{r  message=FALSE}
telco_customer_churn_transformed_data <- telco_customer_churn_transformed_data %>%
     mutate(SeniorCitizen = as.factor(SeniorCitizen),
            gender = as.factor(gender),
            Partner = as.factor(Partner),
            Dependents = as.factor(Dependents),
            SeniorCitizen = as.factor(SeniorCitizen),
            PhoneService = as.factor(PhoneService),
            MultipleLines = as.factor(MultipleLines),
            InternetService = as.factor(InternetService),
            OnlineSecurity = as.factor(OnlineSecurity),
            OnlineBackup = as.factor(OnlineBackup),
            DeviceProtection = as.factor(DeviceProtection),
            TechSupport = as.factor(TechSupport),
            StreamingTV = as.factor(StreamingTV),
            StreamingMovies = as.factor(StreamingMovies),
            PaperlessBilling = as.factor(PaperlessBilling),
            Contract = as.factor(Contract),
            PaymentMethod = as.factor(PaymentMethod),
            Churn = as.factor(Churn))
```

###Data Exploration

---

As the graph below shows, 99.8% of the observations are complete.

```{r message=FALSE}
telco_customer_churn_transformed_data %>%
     plot_intro()
```
The below graph shows us that "TotalCharges" has missing values. Though they are very few, we will check the reason behind this in further sections.
```{r message=FALSE}
telco_customer_churn_transformed_data %>%
     plot_missing(title = "Missing values in variables")
```

From the below count we can see that there are no duplicate Customer IDs in the dataset. There are 7043 records and 7043 unique Customer IDs
```{r}
telco_customer_churn_transformed_data %>%
     summarise(Total_Num_Records = n(),Unique_CustomerIDs = n_distinct(customerID))
```

From the below frequency distributions we can see that there are no duplicate values in any categorical variable. The data appears to be clean.
```{r fig.width=10, message=FALSE}
#Customer ID related information has already been check above. It is not required here.
telco_customer_churn_transformed_data %>%
     select(-customerID) %>%
     plot_bar(ggtheme = theme_light(), title = "Categorical Variable Distributions")

?plot_bar
```

From the below histograms we can see that there are some customers with a tenure of 0 months. We assume that these are new customers, added in the past month.

Also, notice that Total Charges is skewed to the right, which makes sense as we have customers with high tenure. The total charges for them will have accumulated for a long period.
```{r fig.width=6, fig.height=6}
telco_customer_churn_transformed_data %>%
     plot_histogram(ggtheme = theme_light())
```

The missing values of "TotalCharges" that we noticed in earlier sections is because of the new customers (tenure=0). Since they have not completed a billing cycle yet, their Total Charge value is missing. We can confirm this with the result below, which checks if there are any missing TotalCharges values for observations with non-zero tenures. As we can see there are none. So, we can conclude that missing TotalCharges are because of 0 tenure.
```{r}
telco_customer_churn_transformed_data %>%
     filter(is.na(TotalCharges) && tenure!=0)
```

Let's convert missing TotalCharges to 0 and plot missing values again.
```{r}
#Check if TotalCharges is NA and replace with 0
telco_customer_churn_cleaned_data <- telco_customer_churn_transformed_data %>%
     mutate(TotalCharges = ifelse(is.na(TotalCharges),0,TotalCharges))

telco_customer_churn_cleaned_data %>%
     plot_missing()
```
Before proceeding to modeling, let's check if there are any trends in the data. 

The overall categorical variables in this data set can be classified as follows:

* variables that define attributes of a customer
* variables that define attributes of the service

Let's check how these variables affect Churn by plotting them

**Customer Atrributes**

---

From the graphs below we can notice that:

* Gender does not have a clear trend for Churn; the gender split of churned customers is nearly 50-50.
* Senior Citizens and customers with dependents seems to have lower churn, we can check the effect of these variable on churn further.
```{r}
#using plotly here just to explore the functinalities in this package more
#Plot by SeniorCitizen
telco_customer_churn_cleaned_data %>%
     group_by(Churn,SeniorCitizen) %>%
     summarise(count = n()) %>%
     plot_ly(x= ~SeniorCitizen, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
     layout(title = 'Senior Citizen distribution based on Churn')

#Plot by Gender
telco_customer_churn_cleaned_data %>%
     group_by(Churn,gender) %>%
     summarise(count = n()) %>%
     plot_ly(x= ~gender, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
     layout(title = 'Gender distribution based on Churn')

#Plot by Dependents
telco_customer_churn_cleaned_data %>%
     group_by(Churn,Dependents) %>%
     summarise(count = n()) %>%
     plot_ly(x= ~Dependents, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
     layout(title = 'Dependent distribution based on Churn')
    
#Plot by Partner
telco_customer_churn_cleaned_data %>%
     group_by(Churn,Partner) %>%
     summarise(count = n()) %>%
     plot_ly(x= ~Partner, y= ~count, name= ~Churn, type = 'bar', width = 400, height = 400) %>%
     layout(title = 'Partner distribution based on Churn')

```

**Service Atrributes**

---

From the plots below we can infer the following:

* Customers with Fiber Optic Internet Sevice seems to churn more
* Customer with Month-to-month contracts seems to churn more
* Customers who pay their bills through Electronic Check seem to churn more
* Customers who have opted for paperless bills seem to churn more

```{r fig.width=15, fig.height=10}
#using ggplot here just to explore the functinalities in this package more
#Plot by SeniorCitizen
p1 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,PhoneService) %>%
     summarise(count = n()) %>%
     ggplot(aes(PhoneService,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("PhoneService distribution") +
    labs(
        x = "Phone Service",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by Gender
p2 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,MultipleLines) %>%
     summarise(count = n()) %>%
     ggplot(aes(MultipleLines,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("MultipleLines distribution") +
    labs(
        x = "Multiple Lines",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by Dependents
p3 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,InternetService) %>%
     summarise(count = n()) %>%
     ggplot(aes(InternetService,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("InternetService distribution") +
    labs(
        x = "Internet Service",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by OnlineSecurity
p4 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,OnlineSecurity) %>%
     summarise(count = n()) %>%
     ggplot(aes(OnlineSecurity,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("OnlineSecurity distribution") +
    labs(
        x = "Online Security",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by OnlineBackup
p5 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,OnlineBackup) %>%
     summarise(count = n()) %>%
     ggplot(aes(OnlineBackup,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("OnlineBackup distribution") +
    labs(
        x = "Online Backup",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by DeviceProtection
p6 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,DeviceProtection) %>%
     summarise(count = n()) %>%
     ggplot(aes(DeviceProtection,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("DeviceProtection distribution") +
    labs(
        x = "Device Protection",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by TechSupport
p7 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,TechSupport) %>%
     summarise(count = n()) %>%
     ggplot(aes(TechSupport,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("TechSupport distribution") +
    labs(
        x = "Tech Support",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by StreamingTV
p8 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,StreamingTV) %>%
     summarise(count = n()) %>%
     ggplot(aes(StreamingTV,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("StreamingTV distribution") +
    labs(
        x = "StreamingTV",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by StreamingMovies
p9 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,StreamingMovies) %>%
     summarise(count = n()) %>%
     ggplot(aes(StreamingMovies,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("StreamingMovies distribution") +
    labs(
        x = "Streaming Movies",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by PaperlessBilling
p10 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,PaperlessBilling) %>%
     summarise(count = n()) %>%
     ggplot(aes(PaperlessBilling,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("PaperlessBilling distribution") +
    labs(
        x = "Paperless Billing",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by Contract
p11 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,Contract) %>%
     summarise(count = n()) %>%
     ggplot(aes(Contract,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("Contract Type distribution") +
    labs(
        x = "Contract Type",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )
#Plot by PaymentMethod
p12 <- telco_customer_churn_cleaned_data %>%
     group_by(Churn,PaymentMethod) %>%
     summarise(count = n()) %>%
     ggplot(aes(PaymentMethod,count,fill = Churn))+
     geom_bar(stat = "identity", position = "dodge")+
     coord_flip()+
     ggtitle("PaymentMethod distribution") +
    labs(
        x = "Payment Method",
        y = "Frequency"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(
            fill = "white",
            colour = "lightgrey",
            size = 0.75),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_line(colour = "#C4DFE6", size = 0.05)
    )

grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12)
```

**Other Attributes**

---

Let's look at the following variables in this section:

* Tenure
* Monthly Charges
* Overall Charges

We can notice that customers who churned were predominantly from lower tenure groups. So, this could be an important variable for our modeling. Overallcharges also has the same distribution with respect to churn, but that could be because it has strong correlation with tenure (OverallCharges = Tenure x MonthlyCharges).

```{r fig.height=10}
#Tenure
telco_customer_churn_cleaned_data %>%
      plot_ly(x= ~tenure, name = ~Churn, type = 'histogram', showlegend = T, width = 500, height = 400) %>%
     layout(title = 'Tenure distribution by Churn')

#Monthly Charges
telco_customer_churn_cleaned_data %>%
      plot_ly(x= ~MonthlyCharges, name = ~Churn, type = 'histogram', showlegend = T, width = 500, height = 400) %>%
     layout(title = 'Monthly Charges distribution by Churn')

#Overall Charges
telco_customer_churn_cleaned_data %>%
      plot_ly(x= ~TotalCharges, name = ~Churn, type = 'histogram', showlegend = T, width = 500, height = 400) %>%
     layout(title = 'Total Charges distribution by Churn')

```

By plotting Tenure against MonthlyCharges, we can notice in the below graph that churned customers are mostly concentrated in the low tenure + high monthly charges region. These two variables could be useful in our model.
```{r message=FALSE}
telco_customer_churn_cleaned_data %>%
     plot_ly(x = ~tenure, y = ~MonthlyCharges, name = ~Churn, width = 1000, height = 500) %>%
     layout(title = 'Monthly Charges vs Tenure by Churn')
```
By plotting Tenure against TotalCharges, we can notice in the below graph that churned customers are mostly concentrated in the low tenure + low TotalCharges region.

```{r message=FALSE}
telco_customer_churn_cleaned_data %>%
     plot_ly(x = ~tenure, y = ~TotalCharges, name = ~Churn, width = 1000, height = 500) %>%
     layout(title = 'Total Charges vs Tenure by Churn')
```
Plotting Monthly Charges against TotalCharges may not be very useful, but let's see if the plot can give us some insight.
As we can see churned customers are in the low-side of TotalCharges (these could be customers with low tenures).
```{r message=FALSE}
telco_customer_churn_cleaned_data %>%
     plot_ly(x = ~MonthlyCharges, y = ~TotalCharges, name = ~Churn, width = 1000, height = 500) %>%
     layout(title = 'Monthly Charges vs Total Charges by Churn')
```
Let's plot the distribution of categorical variables specifically in **churned customers** to check if any variable stands out.

**Customer Attributes distribution on churned customers**

---

We saw earlier that Senior Citizen and Dependent variables could have correlation with Churn, let's see what is the distribution of these variables in churned customers. As we can see, about 25% of the churned customers are senior citizens and 17.4 percent are those with dependents. So, these attributes may not be as useful as we thought they are afterall.

```{r}
telco_customer_churn_cleaned_data %>%
      filter(Churn==1) %>%
      plot_ly(labels = ~SeniorCitizen, values = ~Churn, type = "pie", width = 500, height = 400) %>%
      layout(title = 'SeniorCitizens distribution in Churned Customers')

telco_customer_churn_cleaned_data %>%
      filter(Churn==1) %>%
      plot_ly(labels = ~Dependents, values = ~Churn, type = "pie", width = 600, height = 400) %>%
      layout(title = 'Dependent distribution in churned customers')
```

**Service Attributes distribution on churned customers**

---

We saw earlier that Payment Method,  Contract Type, Paperless Billing and Internet Service Type attributes could have correlation with Churn, let's see what is the distribution of these variables in churned customers.

As we can notice from the graphs below, all the 4 variables have a specific category that contributes to more than 50% of churned customers. These variables could be very useful for our models.
```{r fig.align='center'}
telco_customer_churn_cleaned_data %>%
      filter(Churn==1) %>%
      plot_ly(labels = ~PaymentMethod, values = ~Churn, type = "pie", width = 600, height = 400) %>%
      layout(title = 'Payment Method distribution in Churned Customers')

telco_customer_churn_cleaned_data %>%
      filter(Churn==1) %>%
      plot_ly(labels = ~Contract, values = ~Churn, type = "pie", width = 600, height = 400) %>%
      layout(title = 'ContractType distribution in churned customers')

telco_customer_churn_cleaned_data %>%
      filter(Churn==1) %>%
      plot_ly(labels = ~PaperlessBilling, values = ~Churn, type = "pie", width = 600, height = 400) %>%
      layout(title = 'PaperlessBilling distribution in churned customers')

telco_customer_churn_cleaned_data %>%
      filter(Churn==1) %>%
      plot_ly(labels = ~InternetService, values = ~Churn, type = "pie", width = 600, height = 400) %>%
      layout(title = 'InternetService distribution in churned customers')
```
###Results of EDA

---

Based on the trends we noticed in EDA, we found that the following variables could be very important to predict customer churn

* Payment Method (ElectronicCheck)
* PaperlessBilling
* ContractType (Month-to-Month)
* InternetService
* Tenure
* MonthlyCharges

Notice that these are all service attributes and **not** customer attributes. May be the company must look at service attributes to control churn more than customemr attributes. This could be a good general guiding principle for the company. 
Let's proceed to modeling now and use these attributes.

###Modeling

---

Dummify the data so that categorical variables are converted to multiple columns with 0/1s.
```{r message=FALSE}
telco_customer_churn_dummified <- telco_customer_churn_cleaned_data %>%
                                        dummify() %>%
                                        mutate(Churn_yes = as.factor(ifelse(Churn_1==1, "Yes", "No")))
```

Partition data into Training and Validation datasets with a 70-30 split of the full dataset.
```{r message=FALSE}
set.seed(10)

inTrainingData <- createDataPartition(telco_customer_churn_dummified$Churn_yes, p=0.70, list=FALSE)

training.set <- telco_customer_churn_dummified[inTrainingData,]

Totalvalidation.set <- telco_customer_churn_dummified[-inTrainingData,]

```

####Random Forest

---

Let's try some mtry values and see which value gives us a lower OOB error percentage. 

After repeated trials, mtry 5 was found to give the lowest OOB error of around 21% as shown below.

As we can see from the **Variable Importance Plots** below, the most important variables for predicting Churn are

* tenure
* TotalCharges
* MonthlyCharges
* Contract_Month.to.month
* InternetServuce_Fiber.optic
* PaymentMethod.Electronic.check

which is pretty much consistent with what we found in our EDA.

```{r fig.height=10, message=FALSE}
training.set <- training.set %>%
                    select(-customerID, -Churn_0, -Churn_1) 

#rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 2)
#rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 3)
#rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "ROC", ntree = 1001, mtry = 4)
rf_fit_check <- randomForest(Churn_yes ~., data = training.set, importance = TRUE, metric = "roc", ntree = 1001, mtry = 5)

VariableImp
varImpPlot(VariableImp)
```

Let's use these variable and build a Random Forest model.


```{r Random Forest}
control <- trainControl(method = "repeatedcv", number = 10, repeats=3, search = "grid", summaryFunction=twoClassSummary, classProbs = TRUE)
set.seed(10)
tunegrid <- expand.grid(.mtry=5)
rf_fit <- train(Churn_yes ~ tenure+
                     MonthlyCharges+
                     TotalCharges+
                     Contract_Month.to.month+
                     InternetService_Fiber.optic,
                data = training.set,
                method = "rf",
                metric = "ROC",
                tunegrid=tunegrid,
                trControl=control,
                preProcess = c("center", "scale"),
                ntree=1001)
```

Let's look at the model stistics of the Random Forest model. 

As we can see the best ROC was achieved at mtry = 2. The model has high number of false negatives (number of churned customers who were classified as Not Churned). Let's see if Logistic Regression can give us better predictions in the next section.
```{r}
print(rf_fit)

plot(rf_fit)


pred_test <- predict(rf_fit,newdata = Totalvalidation.set)

confusionMatrix(data=pred_test, Totalvalidation.set$Churn_yes)

results_1 <- confusionMatrix(data=pred_test, Totalvalidation.set$Churn_yes)
confusion_mat1 <- as.data.frame(results_1$table)

ggplot(data =  confusion_mat1, mapping = aes(x = Reference, y = Prediction)) +
  ggtitle("Confusion Matrix for Random Forest") +
  geom_tile(aes(fill = Freq), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#B2DBD5", high = "#2B616D") +
  theme_bw() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold"))

```

####Logistic Regression

---

Let's try to find out which predictors are statistically significant for a GLM model.

As we can see, the following variables are statistically significant in this model:

* tenure
* totalCharges
* Contract_Month.to.month
* PaperlessBilling_No
* PaymentMethod_Electronic.check

as they have low P-values. This is almost similar to what we found in Random Forest, except that MonthlyCharges has not been found statistically significant here.

```{r}
set.seed(10)

glm_fit_check <- glm(Churn_yes ~., data = training.set, family = 'binomial')

summary(glm_fit_check)
```
Let;s build a regression model using these variables and check how they predict churn.
```{r}
training.set.2 <- training.set %>%
                    mutate(Churn_yes = as.factor(ifelse(Churn_yes=="Yes",1,0)))
Totalvalidation.set.2 <- Totalvalidation.set %>%
                    mutate(Churn_yes = as.factor(ifelse(Churn_yes=="Yes",1,0)))

control_reg <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(10)
reg_fit <- train(Churn_yes ~ tenure+
                             TotalCharges+
                             Contract_Month.to.month+
                             PaperlessBilling_No+
                             PaymentMethod_Electronic.check+
                             Contract_One.year,
                         data = training.set.2,
                         method = "glm",
                         family = "binomial",
                         preProcess = c("center","scale"),
                         trControl = control_reg)
```

As we can see the accuracy level has dropped to 77%.
```{r}
print(reg_fit)

pred_test_reg <- predict(reg_fit,newdata = Totalvalidation.set.2)

confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)

results_2 <- confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)
confusion_mat2 <- as.data.frame(results_2$table)

ggplot(data =  confusion_mat2, mapping = aes(x = Reference, y = Prediction)) +
  ggtitle("Confusion Matrix for Regression Iteration 1") +
  geom_tile(aes(fill = Freq), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#B2DBD5", high = "#2B616D") +
  theme_bw() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold"))

```

Let's try to run regression using the same variables that we used in Random Forest and check the results
```{r}
control_reg_2 <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(10)
reg_fit_2 <- train(Churn_yes ~ tenure+
                               MonthlyCharges+
                               TotalCharges+
                               Contract_Month.to.month+
                               InternetService_Fiber.optic,
                         data = training.set.2,
                         method = "glm",
                         family = "binomial",
                         preProcess = c("center","scale"),
                         trControl = control_reg_2)

```

As we can see the accuracy level is still 77%. And the false negatives are still on the higher side.
```{r}
print(reg_fit_2)

pred_test_reg_2 <- predict(reg_fit_2,newdata = Totalvalidation.set.2)

confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)

results_3 <- confusionMatrix(data=pred_test_reg, Totalvalidation.set.2$Churn_yes)
confusion_mat3 <- as.data.frame(results_3$table)

ggplot(data =  confusion_mat3, mapping = aes(x = Reference, y = Prediction)) +
  ggtitle("Confusion Matrix for Regression Iteration 2") +
  geom_tile(aes(fill = Freq), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#B2DBD5", high = "#2B616D") +
  theme_bw() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold"))
```

###Conclusions

---

Between the two models used, Random Forest is the winner as it has better accuracy.

The following conclusions can be drawn from this analysis:

* Churn seems to be more affected by service attributes than customer demographic attributes.
* Customers who have been associated with the company longer seems less likely to churn.
* Customers with high monthly charges are more likely to churn.
* Customers who have Fiber Optic internet connection are more likely to churn.

